Personal Loan Campaign¶

Problem Statement¶

context¶

AllLife Bank is a US bank that has a majority of liability customers(depositors). The number of asset customers (borrowers) is quite small. The bank is interested in expanding its base rapidly to bring in more loan business and in the process earn more through the interest on loans. In particular the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors)

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. So the retail marketing department wants to devise campaigns with better target marketing to increase success ratio.

Dataset¶

dataset for analysis includes below columns

ID, Age, Experience, Income, ZIPCode, Family, CCAvg, Education, Mortgage, Personal_Loan, Securities_Account, CD_Account, Online, CreditCard

Key Questions to be answered¶

  • How many rows and columns are present in the data?
  • What are the datatypes of the different columns in the dataset?
  • Are there any missing values in the data? If yes, how will they be treated?
  • What insights can be drawn from statistical summaries of the data?
  • Which customers to be targeted so that the sale of personal loans increases?
  • What is the relationship between Personal_Loan and Income?

Objective¶

  • To predict whether the liability customer will buy the personal loans.
  • To understand which customer attributes are most significant in driving purchases.
  • To identify which segment of customers to target more.

Data Dictionary¶

  1. ID - Unique customer ID

  2. Age - Customers age in completed years

  3. Experience - Number of years of professional experience

  4. Income - Annual income of the customer (In thousand dollars)

  5. ZIPCode - Home address zip code

  6. Family - The family size of the customer

  7. CCAvg - Average spending on credit cards per month (In thousand dollars)

  8. Education - Education level

    -    1 Represents Undergraduate,
    -    2 Represents Graduate,
    -    3 Represents Advanced / Professional
  9. Mortgage - Value of house mortgage if any (In thousand dollars)

  10. Personal_Loan - Did this customer accept the personal loan offered in the last campaign

    • 0 Represents customer did not accept the personal loan offered in the last campaign
    • 1 Represents customer did accept the personal loan offered in the last campaign
  11. Securities_Account - Does the customer has the securities account with the bank

    • 0 Represents customer does not have the securities account with the bank
    • 1 Represents customer has the securities account with the bank
  12. CD_Account - Does the customer has a certificate of deposit account with the bank

    • 0 Represents customer does not have a certificate of deposit account with the bank
    • 1 Represents customer has a certificate of deposit account with the bank
  13. Online - Do this customer uses Internet Banking facilities

    • 0 Represents customer does not use Internet Banking facilities
    • 1 Represents customer uses Internet Banking facilities
  14. CreditCard - Does the customer uses credit card issued by any other bank ( Excluding all life bank)

    • 0 Represents customer does not use credit card issued by any other bank
    • 1 Represents customers using credit cards issued by any other bank.

Importing the required libraries¶

In [ ]:
# libraries for data and calculations
import pandas as pd
import numpy as np
import math

# for handling zipcodes
!pip install pyzipcode
from pyzipcode import ZipCodeDatabase

# libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for handling imbalanced data
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

# to split data into training and test sets
from sklearn.model_selection import train_test_split

# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# to tune different models
from sklearn.model_selection import GridSearchCV

# to compute classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
)

# libraries to suppress warnings
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: pyzipcode in /usr/local/lib/python3.11/dist-packages (3.0.1)

Understanding the structure of the data¶

In [ ]:
# uncomment and run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
# code here to read the data
df = pd.read_csv('/content/drive/MyDrive/Great learning/Machine Learning/Loan_Modelling/Loan_Modelling.csv')
#df = pd.read_csv('Loan_Modelling.csv')
data = df.copy()
In [ ]:
# to view the first 5 rows
data.head()
In [ ]:
# Rows and columns in the data
print(data.duplicated().sum())
print(data.shape)
0
(5000, 14)
Observations:¶

There are 14 columns and 5000 rows in the data with no duplicates.

In [ ]:
 
In [ ]:
# the datatypes of the different columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
Observations:¶
  • ID is integer type with no null values.
  • Age is integer type with no null values.
  • Experience is integer type with no null values.
  • Income is integer type with no null values.
  • ZIPCode is an integer type with no null values. It is a categorical variable.
  • Family is integer type with no null values. It is a categorical variable.
  • CCAvg is a float type with no null values.
  • Education is integer type with no null values. It is a categorical variable and is encoded by default.
  • Mortgage is integer type with no null values.
  • Personal_Loan is an integer type with no null values. It is a categorical variable and is encoded by default.
  • Securities_Account is integer type with no null values. It is a categorical variable and is encoded by default.
  • CD_Account is an integer type with no null values. It is categorical variable and is encoded by default
  • Online is integer type with no null values. It is categorical variable and is encoded by default
  • CreditCard is an integer type with no null values. It is categorical variable and is encoded by default
  • There are a total of 5 numerical variables and 8 categorical variables and 1 Id variable which is unique for each record.
  • There are no null values in data

Check the statistical summary of the data.¶

In [ ]:
# check statistical summary of data
data.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Observations:¶

  • Average age of customers is 45.33 years.
  • Average experience of customers is 20.10 years. Minimum experience is -3 years have to check out how to handle this?
  • Average income is 73.77 thousand dollars. Minimum income is 8 and maximum income is 224 thousand dollars, indicating the distribution of income is right skewed.
  • Minimum family size is 1 and maximum family size is 4.
  • Average credit card spending is 1.93 thousand dollars per month.
  • Average house mortgage value is 56.49 thousand dollars.

Data Preprocessing¶

  • There are no missing data however variable Experience has negative values which needs to be treated as missing data
  • There are no duplicate records in data
  • Zip codes should be converted to states and cities to get more insights from data.
In [ ]:
# impute negative values from Experience with zero
data['Experience'] = data['Experience'].apply(lambda x: np.nan if x<0 else x)
In [ ]:
#Lets find out outliers in data
# defining the list of numerical features to plot
num_features = ['Age', 'Experience','Income', 'CCAvg', 'Mortgage']

# plotting the boxplot for each numerical feature
for i, feature in enumerate(num_features):
    plt.subplot(3, 2, i+1)    # assign a subplot in the main plot
    sns.boxplot(data=data, x=feature)
No description has been provided for this image
  • Income, CCAvg and Mortgage have outliers, however if someone has income of 190 thousand dollars a year their average monthly credit card spending can be up to 10 thousand dollars which seems fine. Outlier treatment is not needed here.
  • The median home price in California is 829 thousand dollars, according to a report from the California Association of Realtors, so outliers in mortgage do seem to be true data points well within range. Outlier treatment is not needed here.
In [ ]:
# convert zip code to city and state
zcdb = ZipCodeDatabase()
def get_city_state(zip_code):
  try:
        zip_info = zcdb[zip_code]
        return zip_info.city, zip_info.state
  except KeyError:
        return None, None
data['City'], data['State'] = zip(*data['ZIPCode'].apply(get_city_state))
data.head()
In [ ]:
data.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 4948.0 20.331043 11.311973 0.0 10.75 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
In [ ]:
data.describe(include='object').T
Out[ ]:
count unique top freq
City 4966 242 Los Angeles 375
State 4966 1 CA 4966
  • Negative experience values are imputed with 0
  • Zip codes are converted into cities and states, All customers in the data belong to 242 different cities in California state.
  • 375 customers are from Los Angeles city, which is the maximum number of customers from a single city.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Age, Experience, Income, CCAvg, Mortgage¶
In [ ]:
# defining the figure size
plt.figure(figsize=(15, 10))

# plotting the histogram for each numerical feature
for i, feature in enumerate(num_features):
    plt.subplot(3, 2, i+1)    # assign a subplot in the main plot
    sns.histplot(data=data, x=feature, kde=True)    # plot the histogram
plt.tight_layout();   # to add spacing between plots
plt.show()
No description has been provided for this image
observations¶
  • Distribution of Age is uniform indicating the average number of customers who are of age 30 to 70 is the same.
  • Distribution of Experience is also uniform indicating the average number of customers having years of experience between 3 to 37 years is the same.
  • Distribution of income is little right skewed indicating there are more customers earning between around 0 to 99 thousand dollars.
  • Distribution of credit card average is right skewed, with a sharp reduction after 1.9 thousand dollars, indicating most customers prefer to spend up to 1.9 thousand dollars on an average using credit cards per month.
  • Distribution of mortgage is right skewed, more than 3000 customers have house mortgage values zero, indicating most customers do not possess their own house.
In [ ]:
# defining the figure size
plt.figure(figsize=(15, 10))

# defining the list of numerical features to plot
num_features = ['Age', 'Experience','Income', 'CCAvg', 'Mortgage']

# plotting the histogram for each numerical feature
for i, feature in enumerate(num_features):
    plt.subplot(3, 2, i+1)    # assign a subplot in the main plot
    sns.boxplot(data=data, x=feature)    # plot the histogram
plt.tight_layout();   # to add spacing between plots
plt.show()
No description has been provided for this image
observations¶
  • 50% of customers are aged between 35 to 55 years, there are obviously no outliers.
  • 50% of customers have around 10 to 30 years of experience, there are obviously no outliers.
  • There are a good number of outliers in Income indicating there are a good number of customers having very high income.
  • 50% of customers have income between 40 to 100 thousand dollars.
  • There are a lot of outliers in the CCAvg variable indicating there are a good number of customers who spend more than 5 thousand dollars monthly using credit cards.
  • 50% of customers spend around 0.7 to 2.3 thousand dollars per month using credit cards.
  • There are a lot of outliers in mortgage variable indicating a good amount of customers having house mortgage value of more than 250 thousand dollars.
  • 50% of customers have a house mortgage value of around 0 to 100 thousand dollars.
  • Income, CCAvg and Mortgage have outliers, however if someone has income of 190 thousand dollars a year their average monthly credit card spending can be up to 10 thousand dollars which seems fine. Outlier treatment is not needed here.
  • The median home price in California is 829 thousand dollars, according to a report from the California Association of Realtors, so outliers in mortgage do seem to be true data points well within range. Outlier treatment is not needed here.
Family, Education, Personal_Loan, Securities_Account, CD_Account, Online, CreditCard¶
In [ ]:
# defining the figure size
plt.figure(figsize=(15, 10))

# defining the list of numerical features to plot
cat_features = [ 'Family',
       'Education',  'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard']

# plotting the histogram for each numerical feature
for i, feature in enumerate(cat_features):
    plt.subplot(3, 3, i+1)    # assign a subplot in the main plot
    sns.countplot(data=data, x=feature)    # plot the histogram
plt.tight_layout();   # to add spacing between plots
plt.show()
No description has been provided for this image
observations¶
  • More than 1400 customers have family size of 1 followed by family size 2 and 4 with more than 1200 customers indicating many customers prefer a small family size.
  • Around 1000 customers have a family size of 3.
  • More than 2000 customers are undergraduates, followed by more than 1250 graduates and Advanced/Professional customers, indicating most customers are undergraduates.
  • Less than 500 customers have accepted personal loan offers in the last campaign while more than 4000 customers did not accept personal loan offers indicating a very unbalanced data when it comes to personal loan.
  • Around 500 customers have a securities account with the bank while more than 4000 customers do not have a securities account with the bank.
  • Around 200 customers have a certificate of deposit account with the bank while more than 4000 customers do not have a certificate of deposit account with the bank
  • More than 2500 customers use online banking while around 2000 customers do not use it, indicating online banking is preferred by more than half of the customers.
  • Around 1500 customers use credit cards issued by other banks while around 3500 customers do not use credit cards issued by other banks, indicating most customers prefer to use credit cards issued by ALLLife Bank.
Zip Codes¶
In [ ]:
# Step 1: Calculate count of customers in each city
customer_city_counts = data['City'].value_counts()
customer_city_counts
Out[ ]:
count
City
Los Angeles 375
San Diego 269
San Francisco 257
Berkeley 241
Sacramento 148
... ...
Sierra Madre 1
Sausalito 1
Ladera Ranch 1
Tahoe City 1
Stinson Beach 1

242 rows × 1 columns


In [ ]:
# values of customers from city ranges from 1 to 375 creates bins accordingly
# Step 2: Define bins based on the counts - here's an example of using quantiles
city_bin_edges = [0, 10, 100, 200, float('inf')]
city_labels = ['Low', 'Medium', 'High', 'Very High']
city_bins = pd.cut(customer_city_counts, bins=city_bin_edges, labels=city_labels, right=False)

# Step 3: Create a new DataFrame to map original categories to the new bins
city_groups = pd.DataFrame({'Category': customer_city_counts.index, 'Counts': customer_city_counts.values, 'City_Category': city_bins.values})

# Step 4: Merge this with your original data, if needed
city_data = data.merge(city_groups[['Category', 'City_Category']], left_on='City', right_on='Category', how='left').drop('Category', axis=1)

# Display the updated DataFrame
city_data.head(2)
In [ ]:
# find out how many citys are in each category
city_bin_counts = city_bins.value_counts()
city_bin_counts.index.name = 'City_Category'
city_bin_counts.name = 'count of cities'
city_bin_counts
Out[ ]:
count of cities
City_Category
Low 127
Medium 105
High 6
Very High 4

In [ ]:
# find out which are 20 cities with high number of customers
sr_imp_city = data['City'].value_counts().sort_values(ascending=False).head(20)
df_imp_city_counts = sr_imp_city.reset_index()
df_imp_city_counts.columns = ['City', 'Count']

df_bin_city_counts = city_bin_counts.reset_index()
df_bin_city_counts.columns = ['City category', 'Count']

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(9, 6)) # 1 row, 2 columns
sns.barplot(x='City category', y='Count', data=df_bin_city_counts, ax=axes[0])
axes[0].set_title('Countplot of City category with Count')
axes[0].set_xlable = 'City category'
axes[0].set_ylabel('Count')


ax1 = sns.barplot(x='City', y='Count', data=df_imp_city_counts, ax=axes[1])
axes[1].set_title('Countplot of 20 cities with highest customers')
axes[1].set_xlable = 'Cities'
axes[1].set_ylabel('Count')
plt.xticks(rotation=90)


plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()
No description has been provided for this image
  • There are 127 cities in low category(1 to 9 customers)
  • There are 105 cities in medium category(10 to 99 customers)
  • There are 6 cities in high category (100 to 199 customers)
  • There are 3 cities in very high category (more than 199 customers)
  • Los Angeles is on top with more than 350 customers, followed by San Diego, SanFrancisco and Berkeley.

Observations of univariate analysis:¶

  • Distribution of Age is uniform indicating the average number of customers who are of age 30 to 70 is the same.
  • Distribution of Experience is also uniform indicating the average number of customers having years of experience between 3 to 37 years is the same.
  • Distribution of income is little right skewed indicating there are more customers earning between around 0 to 99 thousand dollars.
  • Distribution of credit card average is right skewed, with a sharp reduction after 1.9 thousand dollars, indicating most customers prefer to spend up to 1.9 thousand dollars on an average using credit cards per month.
  • Distribution of mortgage is right skewed, more than 3000 customers have house mortgage values zero, indicating most customers do not possess their own house.
  • 50% of customers are aged between 35 to 55 years, there are obviously no outliers.
  • 50% of customers have around 10 to 30 years of experience, there are obviously no outliers.
  • There are a good number of outliers in Income indicating there are a good number of customers having very high income.
  • 50% of customers have income between 40 to 100 thousand dollars.
  • There are a lot of outliers in the CCAvg variable indicating there are a good number of customers who spend more than 5 thousand dollars monthly using credit cards.
  • 50% of customers spend around 0.7 to 2.3 thousand dollars per month using credit cards.
  • There are a lot of outliers in mortgage variable indicating a good amount of customers having house mortgage value of more than 250 thousand dollars.
  • 50% of customers have a house mortgage value of around 0 to 100 thousand dollars.
  • Income, CCAvg and Mortgage have outliers, however if someone has income of 190 thousand dollars a year their average monthly credit card spending can be up to 10 thousand dollars which seems fine. Outlier treatment is not needed here.
  • The median home price in California is 829 thousand dollars, according to a report from the California Association of Realtors, so outliers in mortgage do seem to be true data points well within range. Outlier treatment is not needed here.
  • More than 1400 customers have family size of 1 followed by family size 2 and 4 with more than 1200 customers indicating many customers prefer a small family size.
  • Around 1000 customers have a family size of 3.
  • More than 2000 customers are undergraduates, followed by more than 1250 graduates and Advanced/Professional customers, indicating most customers are undergraduates.
  • Less than 500 customers have accepted personal loan offers in the last campaign while more than 4000 customers did not accept personal loan offers indicating a very unbalanced data when it comes to personal loan.
  • Around 500 customers have a securities account with the bank while more than 4000 customers do not have a securities account with the bank.
  • Around 200 customers have a certificate of deposit account with the bank while more than 4000 customers do not have a certificate of deposit account with the bank
  • More than 2500 customers use online banking while around 2000 customers do not use it, indicating online banking is preferred by more than half of the customers.
  • Around 1500 customers use credit cards issued by other banks while around 3500 customers do not use credit cards issued by other banks, indicating most customers prefer to use credit cards issued by ALLLife Bank.
  • Los Angeles is on top with more than 350 customers, followed by San Diego, SanFrancisco and Berkeley.

Multivariate Analysis¶

Numeric variables with each other along with Personal_Loan¶

In [ ]:
# defining the size of the plot
plt.figure(figsize=(5, 4))

# plotting the heatmap for correlation
sns.heatmap(data[num_features].corr(),annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
No description has been provided for this image
In [ ]:
# Scatter plot matrix
plt.figure(figsize=(12, 8))
sns.pairplot(data, vars=num_features, hue='Personal_Loan', diag_kind='kde');
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image
Observations:¶
  • Age and experience are highly positively correlated to each other, indicating as Age increases Experience increases as well.
  • Income and CCAvg positively correlated to each other with correlation of 0.65, indicating as income increases average credit card spending per month increases as well.
  • Personal loan offers were accepted by some customers with age above 45 years and experience above 18 years.
  • Considering Age and income together, customers who accepted personal loan offers have income higher than 100 thousand dollars and are from all age groups.
  • Considering Age and CCAvg, customers who accept personal loan offers spend more than around 2.5 thousand dollars per month using credit cards and are from all age groups.
  • Considering Age and mortgage customers who accepted personal loan offers have house mortgage values more than around 200 thousand dollars and are from all age groups.
In [ ]:
# Income vs Personal_Loan (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=data, x='Personal_Loan', y='Income')
plt.title('Income vs Personal_Loan (Boxplot)')
plt.show()
No description has been provided for this image
  • 50% of customers who accepted personal loan offers have income more than 100 thousand dollars. however there are many outliers among customers who did not accept personal loan offers with high income, indicating interest in personal loan decreases among customers with income of more than around 170 thousand dollars.
In [ ]:
# Mortgage vs Personal_Loan (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=data, x='Personal_Loan', y='Mortgage')
plt.title('Mortgage vs Personal_Loan (Boxplot)')
plt.show()
No description has been provided for this image
  • There are outliers in both cases of personal loan with respect to high house mortgage value customers, indicating these customers show less interest in personal loans.
  • Also there is mixed response with respect to low house mortgage value as well.
  • 50% of customers who accept personal loans have average credit card spending per month between around 2.5 to 5.3 thousand dollars.
  • 50% of customers who did not accept personal loans have average credit card spending per month between around 0.5 to 2.2 thousand dollars.
  • There are many outliers in customers who did not accept personal loan offers who also have high credit card spending per month that is more than around 4.3 thousand dollars.
In [ ]:
# CCAvg vs Personal_Loan (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=data, x='Personal_Loan', y='CCAvg')
plt.title('CCAvg vs Personal_Loan (Boxplot)')
plt.show()
No description has been provided for this image
  • 50% of customers who accepted personal loans have average credit card spending per month between around 2.5 to 5.3 thousand dollars.
  • 50% of customers who did not accept personal loan have average credit card spending per month between around 0.5 to 2.2 thousand dollars.
  • There are many outliers in customers who did not accept personal loan offers who also have high credit card spending per month that is more than around 4.3 thousand dollars.

categorical variables with each other along with Personal_Loan¶

In [ ]:
# crosstab of categorical columns Family and Personal_Loan
crosstb = pd.crosstab(data['Family'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan         0         1
Family                           
1              0.927310  0.072690
2              0.918210  0.081790
3              0.868317  0.131683
4              0.890344  0.109656
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • Customers with family size 3 and 4 (big family) are more likely to be interested in personal loans than customers with small families.
  • Customers with family size 1 are more likely to not accept personal loan offers.
In [ ]:
# crosstab of categorical columns Education and Personal_Loan
crosstb = pd.crosstab(data['Education'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan         0         1
Education                        
1              0.955630  0.044370
2              0.870278  0.129722
3              0.863424  0.136576
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • Customers with Graduate and Advanced / Professional degrees are more likely to accept personal loan offers than customers who are undergraduates.
In [ ]:
# crosstab of categorical columns Securities_Account and Personal_Loan
crosstb = pd.crosstab(data['Securities_Account'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan              0         1
Securities_Account                    
0                   0.906208  0.093792
1                   0.885057  0.114943
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • 11% of customers who have a securities account with a bank have accepted personal loan offers.
  • 9% of customers who do not have a securities account with a bank have accepted personal loan offers.
In [ ]:
# crosstab of categorical columns CD_Account  and Personal_Loan
crosstb = pd.crosstab(data['CD_Account'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan         0         1
CD_Account                       
0              0.927629  0.072371
1              0.536424  0.463576
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • 46% of customers who have a CD account also accepted personal loan offers indicating a fair response if the marketing department targets these customers.
  • Only 7% of customers with no CD account accepted personal loan offers.
In [ ]:
# crosstab of categorical columns Online  and Personal_Loan
crosstb = pd.crosstab(data['Online'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan        0        1
Online                         
0              0.90625  0.09375
1              0.90248  0.09752
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • There is a similar trend with respect to whether a customer uses an online platform or not and whether the customer has accepted a personal loan offer indicating a weak relationship between acceptance of personal loan offer and use of online platform.
In [ ]:
# crosstab of categorical columns CreditCard  and Personal_Loan
crosstb = pd.crosstab(data['CreditCard'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan         0         1
CreditCard                       
0              0.904533  0.095467
1              0.902721  0.097279
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • There is a similar trend with respect to whether a customer uses a credit card from some other bank or not and whether the customer has accepted a personal loan offer, indicating a weak relationship between acceptance of personal loan offer and use of Credit card from another bank.
In [ ]:
# crosstab of categorical columns City  and Personal_Loan
crosstb = pd.crosstab(city_data['City_Category'], data['Personal_Loan'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Personal_Loan         0         1
City_Category                    
Low            0.916393  0.083607
Medium         0.898908  0.101092
High           0.904184  0.095816
Very High      0.908056  0.091944
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • There is a similar trend with respect to city category and customers who accepted personal loan offers, indicating a weak relationship between acceptance of personal loan offer and category of city customer belongs to.
In [ ]:
# Income vs CD_Account  (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=data, x='CD_Account', y='Income')
plt.title('CD_Account vs Income (Boxplot)')
plt.show()
No description has been provided for this image
  • 50% of customers with low income between around 40 to 95 thousand dollars do not have a certificate of deposit account with the bank, however there are many outliers who have income above 170 thousand dollars but do not have a certificate of deposit account with the bank.
  • 50% of customers who have a certificate of deposit account with the bank have income between around 52 to 150 thousand dollars.
In [ ]:
# CD_Account vs Mortgage  (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=data, x='CD_Account', y='Mortgage')
plt.title('CD_Account vs Mortgage (Boxplot)')
plt.show()
No description has been provided for this image
  • 50% of customers who have no certificate of deposit account with a bank have a house mortgage value between around 0 to 100 thousand dollars, however there are many outliers after the house mortgage value of more than 200 thousand dollars.
  • 50% of customers who do have a certificate of deposit account with a bank have house mortgage value between around 0 to 200 thousand dollars, however there are many outliers after the house mortgage value of more than 400 thousand dollars.
In [ ]:
# crosstab of categorical columns Family  and Education
crosstb = pd.crosstab(data['Family'], data['Education'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Education         1         2         3
Family                                 
1          0.460598  0.221467  0.317935
2          0.506944  0.204475  0.288580
3          0.345545  0.379208  0.275248
4          0.337152  0.351064  0.311784
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • Most of the undergraduates have family size 1 and 2.
  • Customers with bigger family sizes like 3 and 4 are more educated than customers with family size 1 or 2.
In [ ]:
# City_Category vs Income (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=city_data, x='City_Category', y='Income')
plt.title('City_Category vs Income (Boxplot)')
plt.show()
No description has been provided for this image
  • 50% of customers from low and medium city category cities have slightly higher income than 50% of customers from high and very high city category cities.
In [ ]:
# City_Category vs Mortgage (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=city_data, x='City_Category', y='Mortgage')
plt.title('City_Category vs Mortgage (Boxplot)')
plt.show()
No description has been provided for this image
  • City category and Mortgage have similar trends, 50% of all customer house mortgage values are between 0 to 100 thousand dollars across all city categories. and all have outliers.
In [ ]:
# crosstab of categorical columns City Category  and Education
crosstb = pd.crosstab(city_data['City_Category'], data['Education'], normalize='index')
print(crosstb)
plt.figure(figsize=(4,3))
barplot = crosstb.plot.bar(stacked=True)
plt.show();
Education             1         2         3
City_Category                              
Low            0.436066  0.301639  0.262295
Medium         0.409624  0.279418  0.310958
High           0.426451  0.271255  0.302294
Very High      0.429072  0.274956  0.295972
<Figure size 400x300 with 0 Axes>
No description has been provided for this image
  • When we look at City category and education , there is almost similar trend, around 40% undergraduates, 28% graduates and 30% Advanced/Professional in each city category.
In [ ]:
# Income vs Education (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=city_data, x='Education', y='Income')
plt.title('Education vs Income (Boxplot)')
plt.show()
No description has been provided for this image
  • When we look at education and income, 50% of undergraduates earn more than 50% of graduates and Advanced/Professionals.
  • However there are outliers in graduates and Advanced/Professionals who earn a high income.
In [ ]:
# Education vs CCAvg (boxplot)
plt.figure(figsize=(7, 4))
sns.boxplot(data=city_data, x='Education', y='CCAvg')
plt.title('CCAvg vs Education (Boxplot)')
plt.show()
No description has been provided for this image
  • When we look at CCAvg and education, undergraduates have more CCAvg than graduates and Advanced/Professionals, indicating undergraduates spend more using their credit cards monthly than graduates and Advanced/Professionals.

Observation of Multivariate Analysis¶

  • Age and experience are highly positively correlated to each other, indicating as Age increases Experience increases as well.
  • Income and CCAvg positively correlated to each other with correlation of 0.65, indicating as income increases average credit card spending per month increases as well.
  • Personal loan offers were accepted by some customers with age above 45 years and experience above 18 years.
  • Considering Age and income together, customers who accepted personal loan offers have income higher than 100 thousand dollars and are from all age groups.
  • Considering Age and CCAvg, customers who accept personal loan offers spend more than around 2.5 thousand dollars per month using credit cards and are from all age groups.
  • Considering Age and mortgage customers who accepted personal loan offers have house mortgage values more than around 200 thousand dollars and are from all age groups.
  • 50% of customers who accepted personal loan offers have income more than 100 thousand dollars. however there are many outliers among customers who did not accept personal loan offers with high income, indicating interest in personal loan decreases among customers with income of more than around 170 thousand dollars.
  • There are outliers in both cases of personal loan with respect to high house mortgage value customers, indicating these customers show less interest in personal loans.
  • Also there is mixed response with respect to low house mortgage value as well.
  • 50% of customers who accept personal loans have average credit card spending per month between around 2.5 to 5.3 thousand dollars.
  • 50% of customers who did not accept personal loans have average credit card spending per month between around 0.5 to 2.2 thousand dollars.
  • There are many outliers in customers who did not accept personal loan offers who also have high credit card spending per month that is more than around 4.3 thousand dollars.
  • 50% of customers who accepted personal loans have average credit card spending per month between around 2.5 to 5.3 thousand dollars.
  • 50% of customers who did not accept personal loan have average credit card spending per month between around 0.5 to 2.2 thousand dollars.
  • There are many outliers in customers who did not accept personal loan offers who also have high credit card spending per month that is more than around 4.3 thousand dollars.
  • Customers with family size 3 and 4 (big family) are more likely to be interested in personal loans than customers with small families.
  • Customers with family size 1 are more likely to not accept personal loan offers.
  • Customers with Graduate and Advanced / Professional degrees are more likely to accept personal loan offers than customers who are undergraduates.
  • 11% of customers who have a securities account with a bank have accepted personal loan offers.
  • 9% of customers who do not have a securities account with a bank have accepted personal loan offers.
  • 46% of customers who have a CD account also accepted personal loan offers indicating a fair response if the marketing department targets these customers.
  • Only 7% of customers with no CD account accepted personal loan offers.
  • There is a similar trend with respect to whether a customer uses an online platform or not and whether the customer has accepted a personal loan offer indicating a weak relationship between acceptance of personal loan offer and use of online platform.
  • There is a similar trend with respect to whether a customer uses a credit card from some other bank or not and whether the customer has accepted a personal loan offer, indicating a weak relationship between acceptance of personal loan offer and use of Credit card from another bank.
  • There is a similar trend with respect to city category and customers who accepted personal loan offers, indicating a weak relationship between acceptance of personal loan offer and category of city customer belongs to.
  • 50% of customers with low income between around 40 to 95 thousand dollars do not have a certificate of deposit account with the bank, however there are many outliers who have income above 170 thousand dollars but do not have a certificate of deposit account with the bank.
  • 50% of customers who have a certificate of deposit account with the bank have income between around 52 to 150 thousand dollars.
  • 50% of customers who have no certificate of deposit account with a bank have a house mortgage value between around 0 to 100 thousand dollars, however there are many outliers after the house mortgage value of more than 200 thousand dollars.
  • 50% of customers who do have a certificate of deposit account with a bank have house mortgage value between around 0 to 200 thousand dollars, however there are many outliers after the house mortgage value of more than 400 thousand dollars.
  • Most of the undergraduates have family size 1 and 2.
  • Customers with bigger family sizes like 3 and 4 are more educated than customers with family size 1 or 2.
  • 50% of customers from low and medium city category cities have slightly higher income than 50% of customers from high and very high city category cities.
  • City category and Mortgage have similar trends, 50% of all customer house mortgage values are between 0 to 100 thousand dollars across all city categories. and all have outliers.
  • When we look at City category and education , there is almost similar trend, around 40% undergraduates, 28% graduates and 30% Advanced/Professional in each city category.
  • When we look at education and income, 50% of undergraduates earn more than 50% of graduates and Advanced/Professionals.
  • However there are outliers in graduates and Advanced/Professionals who earn a high income.
  • When we look at CCAvg and education, undergraduates have more CCAvg than graduates and Advanced/Professionals, indicating undergraduates spend more using their credit cards monthly than graduates and Advanced/Professionals.

Modeling¶

Data Preparation for Modeling¶

  • Since decision trees can not handle categorical data, let's convert city column into dummies and state column can be dropped as all customers are from California state.
  • ID and ZipCode columns can also be dropped.
In [ ]:
# lets do one hot encoding of city, state column can be droped

data = pd.get_dummies(data, columns=['City'], drop_first=True)
data.drop(columns=['State','ID','ZIPCode'], inplace=True)

Model evaluation criterion¶

  • Since only 9% of customers accepted personal loan offers we have very unbalanced data.
  • As data is unbalanced it is better to use recall and precision as well as F1 score as model evaluation criterion.
In [ ]:
# split data into X independent and y target variables
X = data.drop('Personal_Loan', axis=1)
y = data['Personal_Loan']
In [ ]:
# Display class distribution before oversampling
#print("distribution of y before oversampling:", Counter(y))

# since y is imbalanced we shoud balance it by generating more minority records
# using oversampling for data
#oversample = RandomOverSampler(sampling_strategy='minority',random_state=42)
#X, y = oversample.fit_resample(X, y)

# split X and y into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Display class distribution after oversampling
#print("distribution of y_test after oversampling:",Counter(y_train))
In [ ]:
# creating an instance of the decision tree model with default parameters
dtree1 = DecisionTreeClassifier(random_state=42)

# fitting the model to the training data
dtree1.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
In [ ]:
# defining a function to compute different metrics to check performance of decision tree
def get_tree_performance(model, predictors, target):
    """
    Function to compute different metrics to check tree performance
    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)
    print(pred)

    # compute metrics
    recall = recall_score(target, pred)
    precision = precision_score(target, pred)
    f1 = f1_score(target, pred)

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
# plot confusion metrics
def plot_confusion_matrix(model, predictors, target):
    """
    To plot the confusion_matrix with percentages
    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    # Predict the target values using the provided model and predictors
    y_pred = model.predict(predictors)

    # Compute the confusion matrix comparing the true target values with the predicted values
    cm = confusion_matrix(target, y_pred)

    # Create labels for each cell in the confusion matrix with both count and percentage
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

   #print(labels)

   # Plot the confusion matrix as a heatmap with the labels
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True values")
    plt.xlabel("Predicted values by tree")
In [ ]:
plot_confusion_matrix(dtree1, X_train, y_train)
No description has been provided for this image
  • As dtree1 was created with default parameters, it has overgrown to fit data.
  • As you can see from the confusion matrix the error between actual and predicted values is 0%.
  • Let's see performance metrics.
In [ ]:
dtree1_train_perf = get_tree_performance(
    dtree1, X_train, y_train
)
dtree1_train_perf
[0 0 0 ... 0 0 0]
Out[ ]:
Recall Precision F1
0 1.0 1.0 1.0
  • All performance metrics are 1 indicating the model has learned too well, however let's check it with test data.
In [ ]:
plot_confusion_matrix(dtree1, X_test, y_test)
No description has been provided for this image
  • The model has good amount of errors in predictions.
In [ ]:
dtree1_test_perf = get_tree_performance(
    dtree1, X_test, y_test
)
dtree1_test_perf
Out[ ]:
Recall Precision F1
0 0.872611 0.913333 0.892508
  • The difference between training data and test data performance on the default decision tree is a lot, indicating the model has not generalised well.

Visualizing the Decision Tree¶

In [ ]:
# define a function to display decision tree
def display_decision_tree(data_X,dtree):
  """
    To display the decision tree
    data_X: data that has independent variables
    dtree: decision tree model
  """
  # list of feature names in X_train
  feature_names = list(data_X.columns)
  plt.figure(figsize=(20, 20))

  # plotting the decision tree
  out = tree.plot_tree(
    dtree,
    feature_names=feature_names,
    filled=True,                    # fill the nodes with colors based on class
    fontsize=9,
    node_ids=False,
    class_names=True,               # whether or not to display class names
  )

  # add arrows to the decision tree splits if they are missing
  for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")    # set arrow color to black
        arrow.set_linewidth(1)          # set arrow linewidth to 1

  # displaying the plot
  plt.show()


# define a function to print text of decision_tree
def print_decision_tree(data_X,dtree):
  """
    To print a text report showing the rules of a decision tree
    data_X: data that has independent variables
    dtree: decision tree model
  """
  # list of feature names in X_train
  feature_names = list(data_X.columns)

  print(
    tree.export_text(
        dtree,
        feature_names=feature_names,
        show_weights=True    # specify whether or not to show the weights associated with the model
    )
  )
In [ ]:
display_decision_tree(X_train,dtree1)
No description has been provided for this image
  • We can observe that this is a very complex tree.
In [ ]:
print_decision_tree(X_train,dtree1)
|--- Income <= 113.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2546.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- City_Santa Barbara <= 0.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- CCAvg <= 1.75
|   |   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 80.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  80.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  1.75
|   |   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |   |--- City_San Francisco <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [18.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- City_San Francisco >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- City_Santa Barbara >  0.50
|   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 82.50
|   |   |   |   |--- City_Whittier <= 0.50
|   |   |   |   |   |--- Age <= 28.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.00
|   |   |   |   |   |   |--- City_La Jolla <= 0.50
|   |   |   |   |   |   |   |--- City_Glendale <= 0.50
|   |   |   |   |   |   |   |   |--- City_Santa Clara <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [73.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- City_Santa Clara >  0.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- City_Glendale >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- City_La Jolla >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- City_Whittier >  0.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  82.50
|   |   |   |   |--- CCAvg <= 3.95
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- CCAvg <= 3.85
|   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |--- Education <= 2.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |--- Education >  2.00
|   |   |   |   |   |   |   |   |   |--- City_San Francisco <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- City_San Francisco >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 89.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  89.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |   |   |   |   |--- City_Moss Landing <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- City_Torrance <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- City_Torrance >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- City_Moss Landing >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CCAvg >  3.85
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |   |--- Income <= 84.50
|   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |--- City_Thousand Oaks <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- City_Thousand Oaks >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Income >  84.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 11.00] class: 1
|   |   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  3.95
|   |   |   |   |   |--- City_Claremont <= 0.50
|   |   |   |   |   |   |--- City_Chula Vista <= 0.50
|   |   |   |   |   |   |   |--- City_Sunnyvale <= 0.50
|   |   |   |   |   |   |   |   |--- City_San Francisco <= 0.50
|   |   |   |   |   |   |   |   |   |--- City_Irvine <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- City_Los Angeles <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [47.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- City_Los Angeles >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- City_Irvine >  0.50
|   |   |   |   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- City_San Francisco >  0.50
|   |   |   |   |   |   |   |   |   |--- Age <= 52.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  52.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- City_Sunnyvale >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- City_Chula Vista >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- City_Claremont >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- City_Sacramento <= 0.50
|   |   |   |   |--- City_Oakland <= 0.50
|   |   |   |   |   |--- City_Milpitas <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 10.00] class: 1
|   |   |   |   |   |--- City_Milpitas >  0.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- City_Oakland >  0.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- City_Sacramento >  0.50
|   |   |   |   |--- weights: [1.00, 0.00] class: 0
|--- Income >  113.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [399.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 48.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 116.50
|   |   |   |--- CCAvg <= 1.10
|   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |--- CCAvg >  1.10
|   |   |   |   |--- Age <= 51.00
|   |   |   |   |   |--- Mortgage <= 94.50
|   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |--- Mortgage >  94.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Age >  51.00
|   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 209.00] class: 1

  • Income, CCAvg and Education seems to be prominent features

Decision Tree (Pre-pruning)¶

In [ ]:
# define the parameters of the tree to iterate over
max_depth_values = np.arange(4, 10, 1)
max_leaf_nodes_values = np.arange(10, 51, 10)
min_samples_split_values = np.arange(10, 51, 10)

# initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')

# iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                random_state=42
            )

            # fit the model to the training data
            estimator.fit(X_train, y_train)

            # make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # calculate F1 scores for training and test sets
            train_f1_score = f1_score(y_train, y_train_pred)
            test_f1_score = f1_score(y_test, y_test_pred)

            # calculate the absolute difference between training and test F1 scores
            score_diff = abs(train_f1_score - test_f1_score)

            # update the best estimator and best score if the current one has a smaller score difference
            if score_diff < best_score_diff:
                best_score_diff = score_diff
                best_estimator = estimator
In [ ]:
# creating an instance of the best model
dtree2 = best_estimator

# fitting the best model to the training data
dtree2.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(max_depth=np.int64(5), max_leaf_nodes=np.int64(20),
                       min_samples_split=np.int64(20), random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=np.int64(5), max_leaf_nodes=np.int64(20),
                       min_samples_split=np.int64(20), random_state=42)
  • dtree2 is created with max_depth=5, max_leaf_nodes=20, min_samples_split =20 which seems to be the best model for pre pruned tree, lets see its performance.
In [ ]:
plot_confusion_matrix(dtree2, X_train, y_train)
No description has been provided for this image
In [ ]:
# find performance metrics of training data of pre pruned tree
dtree2_train_perf = get_tree_performance(
    dtree2, X_train, y_train
)
dtree2_train_perf
Out[ ]:
Recall Precision F1
0 0.863777 0.96875 0.913257
In [ ]:
plot_confusion_matrix(dtree2, X_test, y_test)
No description has been provided for this image
In [ ]:
# find performance metrics of testing data of pre pruned tree
dtree2_test_perf = get_tree_performance(
    dtree2, X_test, y_test
)
dtree2_test_perf
Out[ ]:
Recall Precision F1
0 0.878981 0.951724 0.913907
  • The training and test scores of the pre pruned tree are very close to each other, indicating the model is generalized well.

Visualizing the Pre Pruned Decision Tree¶


In [ ]:
display_decision_tree(X_train,dtree2)
No description has been provided for this image
  • As we can see this pre pruned decision tree is much less complex than a fully grown tree.
In [ ]:
# printing a text report showing the rules of a decision tree
print_decision_tree(X_train,dtree2)
|--- Income <= 113.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2546.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- City_Santa Barbara <= 0.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [30.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- weights: [27.00, 9.00] class: 0
|   |   |   |--- City_Santa Barbara >  0.50
|   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 82.50
|   |   |   |   |--- Age <= 28.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  28.00
|   |   |   |   |   |--- weights: [74.00, 4.00] class: 0
|   |   |   |--- Income >  82.50
|   |   |   |   |--- CCAvg <= 3.95
|   |   |   |   |   |--- weights: [33.00, 25.00] class: 0
|   |   |   |   |--- CCAvg >  3.95
|   |   |   |   |   |--- weights: [52.00, 6.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 10.00] class: 1
|--- Income >  113.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [399.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 48.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 116.50
|   |   |   |--- Age <= 51.00
|   |   |   |   |--- weights: [6.00, 9.00] class: 1
|   |   |   |--- Age >  51.00
|   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 209.00] class: 1

  • As we can see the pre pruned tree is not complex at all.

Decision Tree (Post-pruning)¶

In [ ]:
# Create an instance of the decision tree model
clf = DecisionTreeClassifier(random_state=42)
# Compute the cost complexity pruning path using the training data
path = clf.cost_complexity_pruning_path(X_train, y_train)
# Extract the array of effective alphas from the pruning path
ccp_alphas = abs(path.ccp_alphas)
# Extract the array of total impurities at each alpha along the pruning path
impurities = path.impurities
In [ ]:
pd.DataFrame(path)
Out[ ]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000270 0.001621
2 0.000282 0.002185
3 0.000381 0.002946
4 0.000381 0.003327
5 0.000381 0.004089
6 0.000381 0.004470
7 0.000429 0.005327
8 0.000440 0.006646
9 0.000484 0.008099
10 0.000527 0.009154
11 0.000528 0.011267
12 0.000535 0.011801
13 0.000541 0.012343
14 0.000543 0.014516
15 0.000555 0.015070
16 0.000584 0.015655
17 0.000794 0.016449
18 0.000827 0.017276
19 0.000935 0.018211
20 0.000940 0.019151
21 0.000988 0.020139
22 0.000990 0.021129
23 0.001052 0.023233
24 0.001262 0.024495
25 0.001448 0.027391
26 0.002380 0.029771
27 0.003972 0.033742
28 0.005182 0.038924
29 0.024483 0.063407
30 0.052065 0.167538
  • As we can see from above table that as alpha increases impurity keeps increasing.
In [ ]:
fig, ax = plt.subplots(figsize=(10, 5))

# Plot the total impurities versus effective alphas, excluding the last value,
# using markers at each data point and connecting them with steps
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("Effective Alpha")
ax.set_ylabel("Total impurity of leaves")
ax.set_title("Total Impurity vs Effective Alpha for training set");
No description has been provided for this image
  • Lets check trees with all values of effective alphas
In [ ]:
# Initialize an empty list to store the decision tree classifiers
clfs = []

# Iterate over each ccp_alpha value extracted from cost complexity pruning path
for ccp_alpha in ccp_alphas:
    # Create an instance of the DecisionTreeClassifier
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)

    # Fit the classifier to the training data
    clf.fit(X_train, y_train)

    # Append the trained classifier to the list
    clfs.append(clf)

# Print the number of nodes in the last tree along with its ccp_alpha value
print(
    "Number of nodes in the last tree is {} with ccp_alpha {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is 1 with ccp_alpha 0.05206542558865251
  • We remove the last element in clfs and ccp_alphas as it will give a tree with only one node. (not useful at all)
In [ ]:
# Remove the last classifier and corresponding ccp_alpha value from the lists
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Extract the number of nodes in each tree classifier
node_counts = [clf.tree_.node_count for clf in clfs]

# Extract the maximum depth of each tree classifier
depth = [clf.tree_.max_depth for clf in clfs]

# Create a figure and a set of subplots
fig, ax = plt.subplots(2, 1, figsize=(10, 7))

# Plot the number of nodes versus ccp_alphas on the first subplot
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs Alpha")

# Plot the depth of tree versus ccp_alphas on the second subplot
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs Alpha")
fig.tight_layout()
No description has been provided for this image
  • from above graphs it is clear that as alpha increases no of nodes and depth of the tree decreases.
  • Lets get F1 scores for train and test data and plot it to check how close train and test F1 scores are to each other.
In [ ]:
train_f1_scores = []  # Initialize an empty list to store F1 scores for training set for each decision tree classifier

# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
    # Predict labels for the training set using the current decision tree classifier
    pred_train = clf.predict(X_train)

    # Calculate the F1 score for the training set predictions compared to true labels
    f1_train = f1_score(y_train, pred_train)

    # Append the calculated F1 score to the train_f1_scores list
    train_f1_scores.append(f1_train)
In [ ]:
test_f1_scores = []  # Initialize an empty list to store F1 scores for test set for each decision tree classifier

# Iterate through each decision tree classifier in 'clfs'
for clf in clfs:
    # Predict labels for the test set using the current decision tree classifier
    pred_test = clf.predict(X_test)

    # Calculate the F1 score for the test set predictions compared to true labels
    f1_test = f1_score(y_test, pred_test)

    # Append the calculated F1 score to the test_f1_scores list
    test_f1_scores.append(f1_test)
In [ ]:
# Create a figure
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("Alpha")  # Set the label for the x-axis
ax.set_ylabel("F1 Score")  # Set the label for the y-axis
ax.set_title("F1 Score vs Alpha for training and test sets")  # Set the title of the plot

# Plot the training F1 scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, train_f1_scores, marker="o", label="training", drawstyle="steps-post")

# Plot the testing F1 scores against alpha, using circles as markers and steps-post style
ax.plot(ccp_alphas, test_f1_scores, marker="o", label="test", drawstyle="steps-post")

ax.legend();  # Add a legend to the plot
No description has been provided for this image
In [ ]:
# creating the model where we get highest test F1 Score
index_best_model = np.argmax(test_f1_scores)

# selcting the decision tree model corresponding to the highest test score
dtree3 = clfs[index_best_model]
print(dtree3)
DecisionTreeClassifier(ccp_alpha=np.float64(0.001261913404200155),
                       random_state=42)
  • Clearly dtree3 which is a post pruned model did not prune anything. and alpha =0 means it is the same overgrown tree as the default tree.

Model Evaluation¶

In [ ]:
plot_confusion_matrix(dtree3, X_train, y_train)
No description has been provided for this image
In [ ]:
# finding out post pruned tree training data performance
dtree3_train_perf = get_tree_performance(
    dtree3, X_train, y_train
)
dtree3_train_perf
Out[ ]:
Recall Precision F1
0 0.826625 0.988889 0.900506
In [ ]:
plot_confusion_matrix(dtree3, X_test, y_test)
No description has been provided for this image
In [ ]:
# finding out post pruned tree testing data performance
dtree3_test_perf = get_tree_performance(
    dtree3, X_test, y_test
)
dtree3_test_perf
Out[ ]:
Recall Precision F1
0 0.866242 0.978417 0.918919
  • As we can see training data performance for post pruned tree is very close to testing data performance, indicating the model has generalised well.

Visualizing Decision Tree¶

In [ ]:
display_decision_tree(X_train,dtree3)
No description has been provided for this image
  • It is a very complex tree.
In [ ]:
print_decision_tree(X_train,dtree3)
|--- Income <= 113.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2603.00, 11.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Income <= 82.50
|   |   |   |   |--- weights: [74.00, 5.00] class: 0
|   |   |   |--- Income >  82.50
|   |   |   |   |--- CCAvg <= 3.95
|   |   |   |   |   |--- weights: [33.00, 25.00] class: 0
|   |   |   |   |--- CCAvg >  3.95
|   |   |   |   |   |--- weights: [52.00, 6.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [3.00, 10.00] class: 1
|--- Income >  113.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [399.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 48.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 116.50
|   |   |   |--- weights: [13.00, 9.00] class: 0
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 209.00] class: 1

Model Performance Comparison and Final Model Selection¶

In [ ]:
# training performance comparison
models_train_comp_df = pd.concat(
    [
        dtree1_train_perf.T,
        dtree2_train_perf.T,
        dtree3_train_perf.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Recall 1.0 0.863777 0.826625
Precision 1.0 0.968750 0.988889
F1 1.0 0.913257 0.900506
In [ ]:
# testing performance comparison
models_test_comp_df = pd.concat(
    [
        dtree1_test_perf.T,
        dtree2_test_perf.T,
        dtree3_test_perf.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[ ]:
Decision Tree (sklearn default) Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Recall 0.872611 0.878981 0.866242
Precision 0.913333 0.951724 0.978417
F1 0.892508 0.913907 0.918919
  • Both the pre-pruned and post-pruned (same as default) decision trees exhibit generalized performances.

  • The post-pruned decision tree has slightly better performance on the training set than the testing set.

    • This model uses all the features for decision-making.
    • This will result in a high prediction time but it might be able to perform well on edge cases in unseen data, as well as normal cases.
  • The pre-pruned decision tree has almost the same performance on training and test sets.

    • This model uses less features for decision-making than the post-pruned decision tree.
    • This will result in a lower prediction time and it is likely to yield better results on unseen data.
    • However it might not be able to perform well on edge cases in unseen data.
  • Since one of the objectives is to identify which segment of customers to target more, a simpler tree than an overgrown tree will be able to achieve this objective in a better way.

  • We'll move ahead with the pre-pruned decision tree as our final model.

Feature Importance¶

In [ ]:
# importance of features in the tree building
importances = dtree2.feature_importances_[0:10]
indices = np.argsort(importances)
feature_names = X_train.columns
plt.figure(figsize=(5, 5))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
  • From the above plot it is clear that Income plays the most prominent role in this prediction model.
  • Followed by Family, CCAvg and Education.

Predicting on a single data point¶

In [ ]:
%%time
# choosing a data point
applicant_details = X_test.iloc[1:2, :]

# making a prediction
approval_prediction = dtree2.predict(applicant_details)
print(approval_prediction)
[1]
CPU times: user 3.51 ms, sys: 0 ns, total: 3.51 ms
Wall time: 3.59 ms
  • The model was able to predict within half a second.
  • Instead of predicting if a customer will buy a personal loan or not, the model can also predict how likely it is that the customer will buy a personal loan.
In [ ]:
# making a prediction
approval_likelihood = dtree3.predict_proba(applicant_details)
print(approval_likelihood[0, 1])
1.0
  • This indicates that the model is ~100% confident that the customer will accept a personal loan offer.

Conclusion and Recommendations¶

Conclusions:¶

  • Income plays the most prominent role in predicting whether a customer will accept a personal loan offer or not.
  • Followed by Family, CCAvg and Education.
  • The model was able to predict within half a second which is good.
  • Instead of predicting if a customer will buy a personal loan or not, the model can also predict how likely it is that the customer will buy a personal loan.

Recommendations:¶

  • AllLife bank can deploy this model for the initial screening of customers to be targeted for marketing personal loans.
  • Instead of outputting if a customer will accept a personal loan offer or not, the model can be made to output the likelihood of acceptance of a personal loan offer by a customer.
  • In case the likelihood of acceptance of a personal loan offer by a customer is above a certain threshold, say 80%, then spending more effort on marketing for that customer is recommended.
  • Also as Income, Family, CCAvg and Education plays the most prominent role in predicting whether a customer will accept a personal loan offer or not, these customer attributes should be taken into consideration while deciding.
  • This will reduce the overall time for identifying potential personal loan customers for the marketing department.